[SPARK-48091][SQL] Preserve aliases inside lambda when ExtractGenerator restructures plan#55892
Conversation
…or restructures plan
ExtractGenerator called trimNonTopLevelAliases on all project list items before extracting the generator. This stripped aliases inside lambda functions (e.g., struct(x.as("data"))) before they could be resolved into struct field names by CreateStruct.
Now only uses trimNonTopLevelAliases for pattern matching to detect generators, but preserves the original untrimmed expression for non-generator project items.
|
@cloud-fan / @dongjoon-hyun / @sarutak could you please review this PR. |
cloud-fan
left a comment
There was a problem hiding this comment.
Prior state and problem. When a project list contains a generator (e.g., explode) alongside transform(arr, x => struct(x.as("data"))), the resulting struct field comes out as col1 instead of data. Root cause is a timing interaction: ExtractGenerator runs at resolution-rule position 530, before ResolveFunctions (532), so the inner struct(...) is still UnresolvedFunction("struct", Seq(Alias(x, "data"))). The pre-PR code applied .map(trimNonTopLevelAliases) to the entire project list before pattern matching. trimAliases has a special case for CreateNamedStruct that preserves alias-carried metadata, but that case requires the expression to already be CreateNamedStruct -- while it is still UnresolvedFunction, the generic case other => other.mapChildren(trimAliases) branch descends into the lambda body and strips Alias(x, "data"). The alias is the only carrier of the name "data" at this stage (the Literal("data") field-name slot inside CreateNamedStruct is produced by CreateStruct.apply later, during ResolveFunctions). Once stripped, the resolved form becomes CreateNamedStruct(Seq(Literal("col1"), x)).
Design approach. Localized workaround in ExtractGenerator's Project case: trim only for AliasedGenerator pattern detection, and splice the original (untrimmed) e into the new project list. CleanupAliases at end-of-analysis still trims later, after ResolveFunctions has captured the alias name.
Concern -- the fix is local, the bug is in trimAliases. The same upfront .map(trimNonTopLevelAliases) exists in the sibling Aggregate-with-generator branch at Analyzer.scala:3211-3253, with the same case (other, idx) shape that propagates the trimmed other. The same struct-field-name regression is reachable for queries that route through that branch. More generally, the root cause is that trimAliases (and via it trimNonTopLevelAliases) descends into unresolved subtrees and strips aliases whose semantic role has not been determined yet -- UnresolvedFunction("struct", ...) is one case, but the pattern is broader.
Would you consider an alternate fix that addresses the timing issue directly in trimAliases? Adding a leading case e if !e.resolved => e clause would (a) cover the Aggregate path without a parallel edit, (b) leave the existing .map(trimNonTopLevelAliases) call sites in place, and (c) protect future callers that hand trimAliases a partially-resolved tree from the same trap. Curious whether you tried that direction and ran into issues, or whether the local fix was preferred for risk containment.
…r workaround Per cloud-fan suggestion, moved the fix from ExtractGenerator to AliasHelper.trimAliases. Added UnresolvedFunction skip case to preserve alias children that carry struct field names. Also fixed ArrayType import nit in test.
|
Thanks for the suggestion! I tried Narrowed it to |
What changes were proposed in this pull request?
Fix
ExtractGeneratorto preserve aliases inside lambda functions when restructuring the plan.Previously,
ExtractGeneratorcalledtrimNonTopLevelAliaseson all expressions in the project list before extracting the generator. This stripped aliases inside lambda functions (e.g., struct(x.as("data"))) beforeCreateStructcould resolve them into struct field names.The fix uses
trimNonTopLevelAliasesonly for pattern matching (to detect generators viaAliasedGenerator), but preserves the original untrimmed expression for non-generator project items.Why are the changes needed?
When using explode together with transform in the same
select statement, aliases used inside the transformed column'sstruct()are ignored. Field names become auto-generated (x_1, x_2) instead of the user-specified alias. This only happens with the DataFrame/Dataset API, not with SQL.Does this PR introduce any user-facing change?
Yes. Struct field aliases inside transform lambdas are now correctly preserved when explode (or any generator) is in the same
select.How was this patch tested?
Added a test in
GeneratorFunctionSuiteverifying that struct field aliases are preserved when explode and transform are used together, including single and multiple aliases.Was this patch authored or co-authored using generative AI tooling?
Yes.